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Abstract 

This work targets human action recognition in video. 
While recent methods typically represent actions by statis¬ 
tics of local video features, here we argue for the impor¬ 
tance of a representation derived from human pose. To this 
end we propose a new Pose-based Convolutional Neural 
Network descriptor (P-CNN) for action recognition. The 
descriptor aggregates motion and appearance information 
along tracks of human body parts. We investigate differ¬ 
ent schemes of temporal aggregation and experiment with 
P-CNN features obtained both for automatically estimated 
and manually annotated human poses. We evaluate our 
method on the recent and challenging JHMDB and MPII 
Cooking datasets. For both datasets our method shows con¬ 
sistent improvement over the state of the art. 


1. Introduction 

Recognition of human actions is an important step to¬ 
ward fully automatic understanding of dynamic scenes. De¬ 
spite significant progress in recent years, action recognition 
remains a difficult challenge. Common problems stem from 
the strong variations of people and scenes in motion and ap¬ 
pearance. Other factors include subtle differences of fine¬ 
grained actions, for example when manipulating small ob¬ 
jects or assessing the quality of sports actions. 

The majority of recent methods recognize actions based 
on statistical representations of local motion descrip¬ 
tors 1221 [33j [4T|. These approaches are very successful 
in recognizing coarse action (standing up, hand-shaking, 
dancing) in challenging scenes with camera motions, oc¬ 
clusions, multiple people, etc. Global approaches, however, 
are lacking structure and may not be optimal to recognize 
subtle variations, e.g. to distinguish correct and incorrect 
golf swings or to recognize fine-grained cooking actions il¬ 
lustrated in Figure 
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Fine-grained recognition in static images highlights the 
importance of spatial structure and spatial alignment as a 
pre-processing step. Examples include alignment of faces 
for face recognition lO as well as alignment of body parts 
for recognizing species of birds O. In analogy to this 
prior work, we believe action recognition will benefit from 
the spatial and temporal detection and alignment of human 
poses in videos. In fine-grained action recognition, this will, 
for example, allow to better differentiate wash hands from 
wash object actions. 

In this work we design a new action descriptor based on 
human poses. Provided with tracks of body joints over time, 
our descriptor combines motion and appearance features for 
body parts. Given the recent success of Convolutional Neu¬ 
ral Networks (CNN) ||20l|23l, we explore CNN features ob¬ 
tained separately for each body part in each frame. We use 
appearance and motion-based CNN features computed for 
each track of body parts, and investigate different schemes 
of temporal aggregation. The extraction of proposed Pose- 
based Convolutional Neural Network (P-CNN) features is 
illustrated in Figure 

Pose estimation in natural images is still a difficult 
task (71[32l|42l. In this paper we investigate P-CNN features 
both for automatically estimated as well as manually anno¬ 
tated human poses. We report experimental results for two 
challenging datasets: JHMDB |[T^ . a subset of HMDB ll2T]| 
for which manual annotation of human pose have been pro¬ 
vided by IT^ , as well as MPII Cooking Activities 1^ , 
composed of a set of fine-grained cooking actions. Eval¬ 
uation of our method on both datasets consistently outper¬ 
forms the human pose-based descriptor HLPE CD. Com¬ 
bination of our method with Dense trajectory features ED 
improves the state of the art for both datasets. 

The rest of the paper is organized as follows. Related 
work is discussed in Section [2l Section [3] introduces our P- 
CNN features. We summarize state-of-the-art methods used 
and compared to in our experiments in Sectionj^and present 
datasets in Section |5] Section |6] evaluates our method and 
compares it to the state of the art. Section [7] concludes the 
paper. Our implementation of P-CNN features is available 
from m. 
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Figure 1: P-CNN features. From left to right: Input video. Human pose and corresponding human body parts for one frame 
of the video. Patches of appearance (RGB) and optical flow for human body parts. One RGB and one flow CNN descriptor 
ff is extracted per frame t and per part p (an example is shown for the human body part right hand). Static frame descriptors 
ff are aggregated over time using min and max to obtain the video descriptor Similarly, temporal differences of ff are 
aggregated to Video descriptors are normalized and concatenated over parts p and aggregation schemes into appearance 
features and flow features v^/. The flnal P-CNN feature is the concatenation of Vapp and v^/. 


2. Related work 

Action recognition in the last decade has been domi¬ 
nated by local features |[22l |33l (411. In particular, Dense 
Trajectory (DT) features BTl combined with Fisher Vector 
(FV) aggregation EH have recently shown outstanding re¬ 
sults for a number of challenging benchmarks. We use IDT- 
FV 11411 (improved version of DT with FV encoding) as a 
strong baseline and experimentally demonstrate its comple¬ 
mentarity to our method. 

Recent advances in Convolutional Neural Networks 
(CNN) 12^ have resulted in signiflcant progress in image 
classiflcation and other vision tasks (TTl [36l [38l. In 
particular, the transfer of pre-trained network parameters 
to problems with limited training data has shown success 
e.g. in |[l7l|26l[34l. Application of CNNs to action recogni¬ 
tion in video, however, has shown only limited improve¬ 
ments so far O 133. We extend previous global CNN 
methods and address action recognition using CNN descrip¬ 
tors at the local level of human body parts. 

Most of the recent methods for action recognition de¬ 
ploy global aggregation of local video descriptors. Such 
representations provide invariance to numerous variations 
in the video but may fail to capture important spatio- 
temporal structure. For flne-grained action recognition, pre¬ 
vious methods have represented person-object interactions 
by joint tracking of hands and objects 1^ or, by linking 
object proposals QD, followed by feature pooling in se¬ 
lected regions. Alternative methods represent actions using 
positions and temporal evolution of body joints. While re¬ 
liable human pose estimation is still a challenging task, the 
recent study HU reports signiflcant gains provided by dy¬ 


namic human pose features in cases when reliable pose es¬ 
timation is available. We extend the work 021 and design a 
new CNN-based representation for human actions combin¬ 
ing positions, appearance and motion of human body parts. 

Our work also builds on methods for human pose estima¬ 
tion in images |[23|3Tl|38ll42l and video sequences ISlI^. 
In particular, we build on the method and extract 
temporally-consistent tracks of body joints from video se¬ 
quences. While our pose estimator is imperfect, we use 
it to derive CNN-based pose features providing signiflcant 
improvements for action recognition for two challenging 
datasets. 

3. P-CNN: Pose-based CNN features 

We believe that human pose is essential for action recog¬ 
nition. Here, we use positions of body joints to deflne in¬ 
formative image regions. We further borrow inspiration 
from oa and represent body regions with motion-based 
and appearance-based CNN descriptors. Such descriptors 
are extracted at each frame and then aggregated over time 
to form a video descriptor, see Figure for an overview. 
The details are explained below. 

To construct P-CNN features, we first compute optical 
flow (41 for each consecutive pair of frames. The method (41 
has relatively high speed, good accuracy and has been re¬ 
cently used in other flow-based CNN approaches (T^l^ . 
Following CD, the values of the motion held Vx^Vy are 
transformed to the interval [0, 255] hy Vx\y = ciVx\y + ^ 
with a = 16 and b = 128. The values below 0 and above 
255 are truncated. We save the transformed flow maps as 
images with three channels corresponding to motion Vx, Vy 
and the flow magnitude. 


2 

















































Given a video frame and the corresponding positions of 
body joints, we crop RGB image patches and flow patches 
for right hand, left hand, upper body, full body and full 
image as illustrated in Figure Each patch is resized to 
224 X 224 pixels to match the CNN input layer. To rep¬ 
resent appearance and motion patches, we use two distinct 
CNNs with an architecture similar to 1^ . Both networks 
contain 5 convolutional and 3 fully-connected layers. The 
output of the second fully-connected layer with k = 4096 
values is used as a frame descriptor (ff). For RGB patches 
we use the publicly available “VGG-f’ network from O 
that has been pre-trained on the ImageNet ILSVRC-2012 
challenge dataset mi. For flow patches, we use the mo¬ 
tion network provided by CD that has been pre-trained for 
action recognition task on the UCFIOI dataset l[35]| . 

Given descriptors ff for each part p and each frame t of 
the video, we then proceed with the aggregation of ff over 
all frames to obtain a fixed-length video descriptor. We con¬ 
sider min and max aggregation by computing minimum 
and maximum values for each descriptor dimension i over 
T video frames 

Mi = max ff(i). 
i<t<r * ^ ' 

The static video descriptor for part p is defined by the con¬ 
catenation of time-aggregated frame descriptors as 

Vstat = . (2) 

To capture temporal evolution of per-frame descriptors, we 
also consider temporal differences of the form Aff = 
^t+At ~ At = 4 frames. Similar to ^ we compute 
minimum Ami and maximum AMi aggregations of Aff 
and concatenate them into the dynamic video descriptor 

^dyn = •••! Amfe, AMi, AMk]^ ■ (3) 

Finally, video descriptors for motion and appearance for all 
parts and different aggregation schemes are normalized and 
concatenated into the P-CNN feature vector. The normal¬ 
ization is performed by dividing video descriptors by the 
average I/ 2 -norm of the ff from the training set. 

In Section we evaluate the effect of different aggrega¬ 
tion schemes as well as the contributions of motion and ap¬ 
pearance features for action recognition. In particular, we 
compare ''Max'' vs. "Max/Min" aggregation where "Max" 
corresponds to the use of Mi values only while "Max/Min" 
stands for the concatenation of Mi and mi defined in 0 and 
0. Mean and Max aggregation are widely used methods 
in CNN video representations. We choose Max-aggr, as it 
outperforms Mean-aggr (see Sectionj^. We also apply Min 
aggregation, which can be interpreted as a “non-detection 


feature”. Additionally, we want to follow the temporal evo¬ 
lution of CNN features in the video by looking at their dy¬ 
namics {Dyn). Dynamic features are again aggregated using 
Min and Max to preserve their sign keeping the largest neg¬ 
ative and positive differences. The concatenation of static 
and dynamic descriptors will be denoted by "Static+Dyn". 

The final dimension of our P-CNN is (5 x 4 x 4AT) x 2 = 
IGOAT, i.e., 5 body parts, 4 different aggregation schemes, 
4K-dimensional CNN descriptor for appearance and mo¬ 
tion. Note that such a dimensionality is comparable to the 
size of Fisher vector 10 used to encode dense trajectory 
features ED. P-CNN training is performed using a linear 
SVM. 

4. State-of-the-art methods 

In this section we present the state-of-the-art methods 
used and compared to in our experiments. We first present 
the approach for human pose estimation in videos m used 
in our experiments. We then present state-of-the-art high- 
level pose features (HLPF) ESI and improved dense trajec¬ 
tories ED. 

4.1. Pose estimation 

To compute P-CNN features as well as HLPF features, 
we need to detect and track human poses in videos. We have 
implemented a video pose estimator based on m. We first 
extract poses for individual frames using the state-of-the- 
art approach of Yang and Ramanan 1421. Their approach 
is based on a deformable part model to locate positions of 
body joints (head, elbow, wrist...). We re-train their model 
on the FLIC dataset ISTl . 

Following 10, we extract a large set of pose configura¬ 
tions in each frame and link them over time using Dynamic 
Programming (DP). The poses selected with DP are con¬ 
strained to have a high score of the pose estimator 14^ . At 
the same time, the motion of joints in a pose sequence is 
constrained to be consistent with the optical flow extracted 
at joint positions. In contrast to 0 we do not perform limb 
recombination. See Figurefor examples of automatically 
extracted human poses. 

4.2. High-level pose features (HLPF) 

High-level pose features (HLPF) encode spatial and tem¬ 
poral relations of body joint positions and were introduced 
in US). Given a sequence of human poses P, positions of 
body joints are first normalized with respect to the person 
size. Then, the relative offsets to the head are computed for 
each pose in P. We have observed that the head is more 
reliable than the torso used in CD. Static features are, then, 
the distances between all pairs of joints, orientations of the 
vectors connecting pairs of joints and inner angles spanned 
by vectors connecting all triplets of joints. 
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Figure 2: Illustration of human pose estimation used in our experiments lO. Successful examples and failure cases on 
JHMDB (left two images) and on MPII Cooking Activities (right two images). Only left and right arms are displayed for 
clarity. 


Dynamic features are obtained from trajectories of body 
joints. HLPF combines temporal differences of some of the 
static features, i.e., differences in distances between pairs of 
joints, differences in orientations of lines connecting joint 
pairs and differences in inner angles. Furthermore, trans¬ 
lations of joint positions (dx and dy) and their orientations 
(arctan{^)) are added. 

All features are quantized using a separate codebook for 
each feature dimension (descriptor type), constructed using 
/c-means with k = 20. A video sequence is then repre¬ 
sented by a histogram of quantized features and the training 
is performed using an SVM with a -kernel. More details 
can be found in (m. To compute HLPF features we use 
the publicly available code with minor modifications, i.e., 
we consider the head instead of the torso center for relative 
positions. We have also found that converting angles, orig¬ 
inally in degrees, to radians and L2 normalizing the HLPF 
features improves the performance. 

4.3. Dense trajectory features 

Dense Trajectories (DT) 13^ are local video descrip¬ 
tors that have recently shown excellent performance in sev¬ 
eral action recognition benchmarks (251001. The method 
first densely samples points which are tracked using optical 
flow ca. For each trajectory, 4 descriptors are computed 
in the aligned spatio-temporal volume: HOG (H, HOF (221 
and MBH (Tol. A recent approach EH removes trajecto¬ 
ries consistent with the camera motion (estimated comput¬ 
ing a homography using optical fiow and SURF Q point 
matches and RANSAC flEl ). Flow descriptors are then 
computed from optical fiow warped according to the esti¬ 
mated homography. We use the publicly available imple¬ 
mentation (4T1| to compute improved version of DT (IDT). 

Fisher Vectors (FV) (27l encoding has been shown to 
outperform the bag-of-word approach 15) resulting in state- 
of-the-art performance for action recognition in combina¬ 
tion with DT features (25l. FV relies on a Gaussian mixture 
model (GMM) with K Gaussian components, computing 
first and second order statistics with respect to the GMM. 
FV encoding is performed separately for the 4 different IDT 


descriptors (their dimensionality is reduced by the factor of 
2 using PC A). Following Ea , the performance is improved 
by passing FV through signed square-rooting and L 2 nor¬ 
malization. As in (25i we use a spatial pyramid represen¬ 
tation and a number of iT = 256 Gaussian components. 
FV encoding is performed using the Yael library ca and 
classification is performed with a linear SVM. 

5. Datasets 

In our experiments we use two datasets JHMDB csi 
and MPII Cooking Activities 1291 . as well as two subsets 
of these datasets sub-JHMDB and sub-MPII Cooking. We 
present them in the following. 

JHMDB (ill is a subset of HMDB (211, see Figure(left). 
It contains 21 human actions, such as brush hair, climb, 
golf, run or sit. Video clips are restricted to the duration of 
the action. There are between 36 and 55 clips per action 
for a total of 928 clips. Each clip contains between 15 and 
40 frames of size 320 x 240. Human pose is annotated in 
each of the 31838 frames. There are 3 train/test splits for 
the JHMDB dataset and evaluation averages the results over 
these three splits. The metric used is accuracy: each clip 
is assigned an action label corresponding to the maximum 
value among the scores returned by the action classifiers. 

In our experiments we also use a subset of JHMDB, re¬ 
ferred to as sub-JHMDB |[l9l . This subset includes 316 
clips distributed over 12 actions in which the human body 
is fully visible. Again there are 3 train/test splits and the 
evaluation metric is accuracy. 

MPII Cooking Activities (291 contains 64 fine-grained 
actions and an additional background class, see Figure 
(right). Actions take place in a kitchen with static back¬ 
ground. There are 5609 action clips of frame size 1624 x 
1224. Some actions are very similar, such as cut dice, cut 
slices, and cut stripes or wash hands and wash objects. 
Thus, these activities are qualified as “fine-grained”. There 
are 7 train/test splits and the evaluation is reported in terms 
of mean Average Precision (mAP) using the code provided 
with the dataset. 
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Parts 

JHMDB-GT 

MPII Cooking-PoseiSi 

App 

OF 

App + OF 

App 

OF 

App + OF 

Hands 

46.3 

54.9 

57.9 

39.9 

46.9 

51.9 

Upper body 

52.8 

60.9 

67.1 

32.3 

47.6 

50.1 

Full body 

52.2 

61.6 

66.1 

- 

- 

- 

Full image 

43.3 

55.7 

61.0 

28.8 

56.2 

56.5 

All 

60.4 

69.1 

73.4 

43.6 

57.4 

60.8 


Table 1: Performance of appearance-based (App) and flow-based (OF) P-CNN features. Results are obtained with max- 
aggregation for JHMDB-GT (% accuracy) and MPII Cooking Activities-Pose [O (% mAP). 


We have also deflned a subset of MPII cooking, referred 
to as sub-MPII cooking, with classes wash hands and wash 
objects. We have selected these two classes as they are vi¬ 
sually very similar and differ mainly in manipulated ob¬ 
jects. To analyze the classiflcation performance for these 
two classes in detail, we have annotated human pose in all 
frames of sub-MPII cooking. There are 55 and 139 clips 
for wash hands and wash objects actions respectively, for a 
total of 29,997 frames. 


serve that the combination of appearance and flow further 
improves the performance for all parts including their com¬ 
bination All. This is the pose representation used in the rest 
of the evaluation. 

In this section, we have applied the max-aggregation (see 
Section for aggregating features extracted for individ¬ 
ual frames into a video descriptor. Different aggregation 
schemes will be compared in the next section. 

6.2. Aggregating P-CNN features 


6. Experimental results 


This section describes our experimental results and ex¬ 
amines the effect of different design choices. First, we eval¬ 
uate the complementarity of different human parts in Sec- 
tion|6.1| We then compare different variants for aggregating 


CNN features in Section 6.2 Next, we analyze the robust¬ 
ness of our features to errors in the estimated pose and their 


ability to classify flne-grained actions in Section |6.3| Fi¬ 
nally, we compare our features to the state of the art and 
show that they are complementary to the popular dense tra¬ 
jectory features in Section [6^ 


6.1. Performance of human part features 

Table compares the performance of human part CNN 
features for both appearance and flow on JHMDB-GT 
(the JHMDB dataset with ground-truth pose) and MPII 
Cooking-Pose m (the MPII Cooking dataset with pose es¬ 
timated by 0). Note, that for MPII Cooking we detect 
upper-body poses only since full bodies are not visible in 
most of the frames in this dataset. 

Conclusions for both datasets are similar. We can ob¬ 
serve that all human parts (hands, upper body, full body) 
as well as the full image have similar performance and that 
their combination improves the performance signiflcantly. 
Removing one part at a time from this combination results 
in the drop of performance (results not shown here). We 
therefore use all pose parts together with the full image de¬ 
scriptor in the following evaluation. We can also observe 
that flow descriptors consistently outperform appearance 
descriptors by a signiflcant margin for all parts as well as 
for the overall combination All. Furthermore, we can ob¬ 


CNN features ft are first extracted for each frame and 
the following temporal aggregation pools feature values 
for each feature dimension over time (see Figure [T]). Re¬ 
sults of max-aggregation for JHMDB-GT are reported in 
Table and compared with other aggregation schemes 
in Table Table shows the impact of adding min- 
aggregation (Max/Min-aggr) and the first-order difference 
between CNN features (All-Dyn). Combining per-frame 
CNN features and their first-order differences using max- 
and min-aggregation further improves results. Overall, 
we obtain the best results with All-(Static-\-Dyn)(Max/Min- 
aggr) for App OF, i.e., 74.6% accuracy on JHMDB-GT. 
This represents an improvement over Max-aggr by 1.2%. 
On MPII Cooking-Pose 0 this version of P-CNN achieves 
62.3% mAP (as reported in Table|^ leading to an 1.5% im¬ 
provement over max-aggregation reported in Table 


Aggregation scheme 

App 

OF 

App+OF 

All(Max-aggr) 

60.4 

69.1 

73.4 

All(Max/Min-aggr) 

60.6 

68.9 

73.1 

All(Static-FDyn)(Max-aggr) 

62.4 

70.6 

74.1 

All(Static-FDyn)(Max/Min-aggr) 

62.5 

70.2 

74.6 

All(Mean-aggr) 

57.5 

69.0 

69.4 


Table 2: Comparison of different aggregation schemes: 
Max-, Mean-, and aggregations as well as adding 

first-order differences (Dyn). Results are given for appear¬ 
ance {App), optical flow {OF) and App OF on JHMDB- 
GT (% accuracy). 

We have also experimented with second-order differences 
and other statistics, such as mean-aggregation (last row in 
Table |^, but this did not improve results. Furthermore, 
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we have tried temporal aggregation of classification scores 
obtained for individual frames. This led to a decrease of 
performance, e.g. for All (App) on JHMDB-GT score-max- 
aggregation results in 56.1% accuracy, compared to 60.4% 
for features-max-aggregation (top row, left column in Ta¬ 
ble [^. This indicates that early aggregation works signifi¬ 
cantly better in our setting. 

In summary, the best performance is obtained for Max- 
aggr on single-frame features, if only one aggregation 
scheme is used. Addition of Min-aggr and first order differ¬ 
ences Dyn provides further improvement. In the remaining 
evaluation we report results for this version of P-CNN, i.e.. 
All parts App+OF with (Static+Dyn)(Max/Min-aggr). 

6.3. Robustness of pose-based features 

This section examines the robustness of P-CNN features 
in the presence of pose estimation errors and compares re¬ 
sults with the state-of-the-art pose features HLPF CU- We 
report results using the code of ESI with minor modifica¬ 
tions described in Section 1431 Our HLPF results are com¬ 
parable to l(T9ll in general and are slightly better on JHMDB- 
GT (77.8% vs. 76.0%). Table [^evaluates the impact of au¬ 
tomatic pose estimation versus ground-truth pose (GT) for 
sub-JHMDB and JHMDB. We can observe that results for 
GT pose are comparable on both datasets and for both type 
of pose features. However, P-CNN is significantly more 
robust to errors in pose estimation. For automatically esti¬ 
mated poses P-CNN drops only by 5.7% on sub-JHMDB 
and by 13.5% on JHMDB, whereas HLPF drops by 13.5% 
and 52.5% respectively. For both descriptors the drop is less 
significant on sub-JHMDB, as this subset only contains full 
human poses for which pose is easier to estimate. Overall 
the performance of P-CNN features for automatically ex¬ 
tracted poses is excellent and outperforms HLPF by a very 
large margin (+35.8%) on JHMDB. 


sub-JHMDB 


GT 

Pose (421 

Diff 

P-CNN 

72.5 

66.8 

5.7 

HLPF 

78.2 

51.1 

27.1 




JHMDB 



GT 

Pose ® 

Diff 

P-CNN 

74.6 

61.1 

13.5 

HLPF 

77.8 

25.3 

52.5 


Table 3: Impact of automatic pose estimation versus 
ground-truth pose (GT) for P-CNN features and HLPF |[T^ . 
Results are presented for sub-JHMDB and JHMDB (% ac¬ 
curacy). 

We now compare and evaluate the robustness of P-CNN 
and HLPF features on the MPII cooking dataset. To eval- 


sub-MPII Cooking 

GT Pose il Diff 

P-CNN 83.6 67.5 16.1 

HLPF 76.2 57.4 18.8 


MPII Cooking 

Pose® 

P-CNN 62.3 

HLPF 32.6 

Table 4: Impact of automatic pose estimation versus 
ground-truth pose (GT) for P-CNN features and HLPF ifT^ . 
Results are presented for sub-MPII Cooking and MPII 
Cooking (% mAP). 

uate the impact of ground-truth pose (GT), we have man¬ 
ually annotated two classes “washing hand” and “washing 
objects”, referred to by sub-MPII Cooking. Table com¬ 
pares P-CNN and HLPF for sub-MPII and MPII Cooking. 
In all cases P-CNN outperforms HLPF significantly. In¬ 
terestingly, even for ground-truth poses P-CNN performs 
significantly better than HLPF. This could be explained by 
the better encoding of image appearance by P-CNN fea¬ 
tures, especially for object-centered actions such as “wash¬ 
ing hands” and “washing objects”. We can also observe 
that the drop due to errors in pose estimation is similar for 
P-CNN and HLPF. This might be explained by the fact that 
CNN hand features are quite sensitive to the pose estima¬ 
tion. However, P-CNN still significantly outperforms HLPF 
for automatic pose. In particular, there is a significant gain 
of +29.7% for the full MPII Cooking dataset. 

6.4. Comparison to the state of the art 

In this section we compare to state-of-the-art dense tra¬ 
jectory features ll4T1l encoded by Fisher vectors |[25l (IDT- 
FV), briefly described in Section |4.3[ We use the online 
available code, which we validated on Hollywood2 (65.3% 
versus 64.3% (HI). Furthermore, we show that our pose 
features P-CNN and IDT-FV are complementary and com¬ 
pare to other state-of-the-art approaches on JHMDB and 
MPII Cooking. 

Table shows that for ground-truth poses our P-CNN 
features outperform state-of-the-art IDT-FV descriptors sig¬ 
nificantly (8.7%). If the pose is extracted automatically both 
methods are on par. Furthermore, in all cases the combi¬ 
nation of P-CNN and IDT-FV obtained by late fusion of 
the individual classification scores significantly increases 
the performance over using individual features only. Fig¬ 
ure [^illustrates per-class results for P-CNN and IDT-FV on 
JHMDB-GT. 

Table compares our results to other methods on MPII 
Cooking. Our approach outperforms the state of the art on 
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JHMDB MPII Cook. 


Method 

GT 

Pose (HI 

Pose (HI 

P-CNN 

74.6 

61.1 

62.3 

IDT-FV 

65.9 

65.9 

67.6 

P-CNN IDT-FV 

79.5 

72.2 

71.4 


Table 5: Comparison to IDT-FV on JHMDB (% accuracy) 
and MPII Cooking Activities (% mAP) for ground-truth 
(GT) and pose m. The combination of P-CNN IDT-FV 
performs best. 



Figure 3: Per class accuracy on JHMDB-GT for P-CNN 
(green) and IDT-FV (red) methods. Values correspond to 
the difference in accuracy between P-CNN and IDT-FV 
(positive values indicate better performance of P-CNN). 


this dataset and is on par with the recently published work 
of (441. We have compared our method with HLPF (T^ 
on JHMDB in the previous section. P-CNN perform on 
par with HLPF for GT poses and significantly outperforms 
HLPF for automatically estimated poses. Combination of 
P-CNN with IDT-FV improves the performance to 79.5% 
and 72.2% for GT and automatically estimated poses re¬ 
spectively (see Table This outperforms the state-of-the- 
art result reported in (T^ . 

Qualitative results comparing P-CNN and IDT-FV are 
presented in Figure for JHMDB-GT. See Figure for 
the quantitative comparison. To highlight improvements 
achieved by the proposed P-CNN descriptor, we show re¬ 
sults for classes with a large improvement of P-CNN over 
IDT-FV, such as shoot_gun, wave, throw and jump as well 
as for a class with a significant drop, namely kickJball. Fig¬ 
ure 1^ shows two examples for each selected action class 
with the maximum difference in ranks obtained by P-CNN 
(green) and IDT-FV (red). For example, the most signif¬ 
icant improvement (Figure top left) increases the sam¬ 
ple ranking from 211 to 23, when replacing IDT-FV by P- 
CNN. In particular, the shoot gun and wave classes involve 
small localized motion, making classification difficult for 


Method 

MPII Cook. 

Holistic Pose (29l 

57.9 

Semantic Features (45l 

70.5 

Interaction Part Mining (44l 

72.4 

P-CNN IDT-FV (our) 

71.4 


Table 6: State of the art on the MPII Cooking (% mAP). 


IDT-FV while P-CNN benefits from the local human body 
part information. Similarly, the two samples from the action 
class throw also seem to have restricted and localized mo¬ 
tion while the action jump is very short in time. In the case 
of kick-ball the significant decrease can be explained by the 
important dynamics of this action, which is better captured 
by IDT-FV features. Note that P-CNN only captures motion 
information between two consecutive frames. 

Figure presents qualitative results for MPII Cooking- 
Pose (SI showing samples with the maximum difference in 
ranks over all classes. 

7. Conclusion 

This paper introduces pose-based convolutional neural 
network features (P-CNN). Appearance and fiow informa¬ 
tion is extracted at characteristic positions obtained from 
human pose and aggregated over frames of a video. Our 
P-CNN description is shown to be significantly more ro¬ 
bust to errors in human pose estimation compared to exist¬ 
ing pose-based features such as HLPF CD. In particular, 
P-CNN significantly outperforms HLPF on the task of fine¬ 
grained action recognition in the MPII Cooking Activities 
dataset. Furthermore, P-CNN features are complementary 
to the dense trajectory features and significantly improve 
the current state of the art for action recognition when com¬ 
bined with IDT-FV. 

Our study confirms conclusions in (T9l . namely, that cor¬ 
rect estimation of human poses leads to significant improve¬ 
ments in action recognition. This implies that pose is cru¬ 
cial to capture discriminative information of human actions. 
Pose-based action recognition methods have a promising 
future due to the recent progress in pose estimation, no¬ 
tably using CNNs (71. An interesting direction for future 
work is to adapt CNNs for each P-CNN part (hands, upper 
body, etc.) by fine-tuning networks for corresponding im¬ 
age areas. Another promising direction is to model temporal 
evolution of frames using RNNs lfT2l[30l . 
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Figure 4: Results on JHMDB-GT (split 1). Each column corresponds to an action class. Video frames on the left (green) 
illustrate two test samples per action with the largest improvement in ranking when using P-CNN (rank in green) and IDT-FV 
(rank in red). Examples on the right (red) illustrate samples with the largest decreases in the ranking. Actions with large 
differences in performance are selected according to Figure |3| Each video sample is represented by its middle frame. 



Figure 5: Results on MPII Cooking-Pose fSl (split 1). Examples on the left (green) show the 8 best ranking improvements 
(over all classes) obtained by using P-CNN (rank in green) instead of IDT-FV (rank in red). Examples on the right (red) 
illustrate video samples with the largest decrease in the ranking. Each video sample is represented by its middle frame. 
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